Data: Handle case where partition location is missing for `TableMigrationUtil` #12212

jshmchenxi · 2025-02-10T02:19:46Z

When we use the SnapshotTableSparkAction to create an Iceberg table based on a Hive table, and the Hive table contains some partition of which the partition location is missing from the file system, the Spark procedure would fail with the following exception:

Caused by: java.lang.RuntimeException: Unable to list files in partition: s3://bucket/table/partition=foo
	at org.apache.iceberg.data.TableMigrationUtil.listPartition(TableMigrationUtil.java:206)
	at org.apache.iceberg.spark.SparkTableUtil.listPartition(SparkTableUtil.java:309)
	at org.apache.iceberg.spark.SparkTableUtil.lambda$importSparkPartitions$37333fc7$1(SparkTableUtil.java:767)
	at org.apache.spark.sql.Dataset.$anonfun$flatMap$2(Dataset.scala:3484)
	at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
	at org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:225)
	at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$.$anonfun$prepareShuffleDependency$10(ShuffleExchangeExec.scala:375)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:893)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:893)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)
	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
	at org.apache.spark.scheduler.Task.run(Task.scala:141)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: java.io.FileNotFoundException: No such file or directory: s3://bucket/table/partition=foo
	at org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:3799)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:3650)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.innerListStatus(S3AFileSystem.java:3373)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$null$22(S3AFileSystem.java:3344)
	at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:122)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$listStatus$23(S3AFileSystem.java:3343)
	at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.invokeTrackingDuration(IOStatisticsBinding.java:547)
	at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.lambda$trackDurationOfOperation$5(IOStatisticsBinding.java:528)
	at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.trackDuration(IOStatisticsBinding.java:449)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.trackDurationAndSpan(S3AFileSystem.java:2478)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.trackDurationAndSpan(S3AFileSystem.java:2497)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.listStatus(S3AFileSystem.java:3342)
	at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2078)
	at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2122)
	at org.apache.iceberg.data.TableMigrationUtil.listPartition(TableMigrationUtil.java:167)
	... 32 more

This PR adds support for such case and treats the partition location as an empty directory.

manuzhang · 2025-02-10T03:34:19Z

@jshmchenxi Thanks for the fix. Can you add a test?

RussellSpitzer

Looks good although I agree we need a test to check that this is working as expected.

jshmchenxi · 2025-02-11T06:52:48Z

@manuzhang @RussellSpitzer Thanks for the suggestion! I've added test cases to cover this change.

data/src/test/java/org/apache/iceberg/data/TestTableMigrationUtil.java

manuzhang · 2025-02-11T07:18:46Z

@jshmchenxi Can we add an end-to-end test in TestSnapshotTableAction?

jshmchenxi · 2025-02-16T02:37:42Z

@jshmchenxi Can we add an end-to-end test in TestSnapshotTableAction?

@manuzhang I've added the end-to-end test. Please take a look.

jshmchenxi · 2025-02-18T01:03:14Z

Kindly ping @manuzhang @RussellSpitzer @stevenzwu

RussellSpitzer · 2025-02-18T15:28:35Z

data/src/main/java/org/apache/iceberg/data/TableMigrationUtil.java

+          Arrays.stream(
+                  fs.exists(partitionDir)
+                      ? fs.listStatus(partitionDir, HIDDEN_PATH_FILTER)
+                      : new FileStatus[] {})


Shouldn't we log something here?

Sounds good. I've added log here.

…tionUtil`

github-actions bot added the data label Feb 10, 2025

jshmchenxi changed the title ~~Handle case where partition location is missing from the file system in TableMigrationUtil~~ Data: Handle case where partition location is missing for TableMigrationUtil Feb 10, 2025

jshmchenxi force-pushed the bugfix/spark-util-partition-location-nonexist branch 2 times, most recently from 5b7ff64 to 658010c Compare February 10, 2025 02:31

rohanag12 approved these changes Feb 10, 2025

View reviewed changes

RussellSpitzer reviewed Feb 10, 2025

View reviewed changes

jshmchenxi force-pushed the bugfix/spark-util-partition-location-nonexist branch 2 times, most recently from 36e2b7b to f3c2e11 Compare February 11, 2025 06:52

jshmchenxi force-pushed the bugfix/spark-util-partition-location-nonexist branch 2 times, most recently from 60ef66a to 23a1b11 Compare February 11, 2025 06:56

ebyhr reviewed Feb 11, 2025

View reviewed changes

jshmchenxi force-pushed the bugfix/spark-util-partition-location-nonexist branch from 23a1b11 to 6b96233 Compare February 11, 2025 07:04

jshmchenxi force-pushed the bugfix/spark-util-partition-location-nonexist branch from 6b96233 to 2aacf49 Compare February 16, 2025 02:34

github-actions bot added the spark label Feb 16, 2025

RussellSpitzer reviewed Feb 18, 2025

View reviewed changes

jshmchenxi force-pushed the bugfix/spark-util-partition-location-nonexist branch from 2aacf49 to 5e58ba9 Compare February 19, 2025 02:43

jshmchenxi added 3 commits February 21, 2025 09:30

Data: Handle case where partition location is missing for `TableMigra…

58f9ebf

…tionUtil`

Add end-to-end test to TestSnapshotTableAction

10424c2

Add log when skipping partition with missing location

e0c9f62

jshmchenxi force-pushed the bugfix/spark-util-partition-location-nonexist branch from 5e58ba9 to e0c9f62 Compare February 21, 2025 01:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data: Handle case where partition location is missing for `TableMigrationUtil` #12212

Data: Handle case where partition location is missing for `TableMigrationUtil` #12212

jshmchenxi commented Feb 10, 2025 •

edited

Loading

manuzhang commented Feb 10, 2025

RussellSpitzer left a comment

jshmchenxi commented Feb 11, 2025

manuzhang commented Feb 11, 2025

jshmchenxi commented Feb 16, 2025

jshmchenxi commented Feb 18, 2025

RussellSpitzer Feb 18, 2025

jshmchenxi Feb 19, 2025

Data: Handle case where partition location is missing for TableMigrationUtil #12212

Are you sure you want to change the base?

Data: Handle case where partition location is missing for TableMigrationUtil #12212

Conversation

jshmchenxi commented Feb 10, 2025 • edited Loading

manuzhang commented Feb 10, 2025

RussellSpitzer left a comment

Choose a reason for hiding this comment

jshmchenxi commented Feb 11, 2025

manuzhang commented Feb 11, 2025

jshmchenxi commented Feb 16, 2025

jshmchenxi commented Feb 18, 2025

RussellSpitzer Feb 18, 2025

Choose a reason for hiding this comment

jshmchenxi Feb 19, 2025

Choose a reason for hiding this comment

Data: Handle case where partition location is missing for `TableMigrationUtil` #12212

Data: Handle case where partition location is missing for `TableMigrationUtil` #12212

jshmchenxi commented Feb 10, 2025 •

edited

Loading